This notebook applies different regression methods to the finalized dataset in search of the best regression model.

Data Ingestion and Wrangling



In [1]:

    
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline



In [2]:

    
dataset_1min = pd.read_csv('dataset-1min.csv')
print(dataset_1min.shape)
dataset_1min.head(3)









    



(12824, 13)






    Out[2]:






  
    
      
      datetime
      temperature
      temperature_f
      humidity
      co2
      light
      light_status
      noise
      bluetooth_devices
      image_hist_change
      door_status
      occupancy_count
      occupancy_category
    
  
  
    
      0
      2017-03-25 09:05:00
      22.6
      72.68
      36.9
      781.0
      430.0
      light-on
      511.0
      1.0
      15.242697
      closed
      0.0
      very-low
    
    
      1
      2017-03-25 09:05:00
      22.6
      72.68
      36.9
      781.0
      430.0
      light-on
      511.0
      1.0
      15.242697
      closed
      0.0
      very-low
    
    
      2
      2017-03-25 09:06:00
      23.8
      74.84
      38.9
      789.0
      437.0
      light-on
      491.0
      16.0
      15.242697
      closed
      0.0
      very-low



In [3]:

    
# Delete duplicate rows in the dataset
dataset_1min = dataset_1min.drop_duplicates()
print(dataset_1min.shape)
dataset_1min.head(3)









    



(8332, 13)






    Out[3]:






  
    
      
      datetime
      temperature
      temperature_f
      humidity
      co2
      light
      light_status
      noise
      bluetooth_devices
      image_hist_change
      door_status
      occupancy_count
      occupancy_category
    
  
  
    
      0
      2017-03-25 09:05:00
      22.60
      72.68
      36.9
      781.0
      430.000000
      light-on
      511.000000
      1.0
      15.242697
      closed
      0.0
      very-low
    
    
      2
      2017-03-25 09:06:00
      23.80
      74.84
      38.9
      789.0
      437.000000
      light-on
      491.000000
      16.0
      15.242697
      closed
      0.0
      very-low
    
    
      4
      2017-03-25 09:07:00
      23.85
      74.93
      38.8
      766.0
      421.653846
      light-on
      506.586957
      25.0
      15.242697
      closed
      15.0
      low



In [4]:

    
# Subset the features needed 
names = ['temperature', 'humidity', 'co2', 'light', 'noise', 'bluetooth_devices','occupancy_count']
df = dataset_1min[names]
df.head(3)









    Out[4]:






  
    
      
      temperature
      humidity
      co2
      light
      noise
      bluetooth_devices
      occupancy_count
    
  
  
    
      0
      22.60
      36.9
      781.0
      430.000000
      511.000000
      1.0
      0.0
    
    
      2
      23.80
      38.9
      789.0
      437.000000
      491.000000
      16.0
      0.0
    
    
      4
      23.85
      38.8
      766.0
      421.653846
      506.586957
      25.0
      15.0



In [5]:

    
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import scale



In [6]:

    
# Set occupancy_count as the dependent variable and others as independent variables
data = df.iloc[:,0:-1]
target = df.iloc[:,-1]

# Split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(data, target)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)









    



(6249, 6) (6249,)
(2083, 6) (2083,)



In [7]:

    
# Standarize data
from sklearn.preprocessing import StandardScaler
standard_scaler = StandardScaler()
X_train = standard_scaler.fit_transform(X_train)
X_test = standard_scaler.transform(X_test)

Lasso Regression

Lasso regression is characterized by the tendancy of the regularization to push the weights of variables to zero. This is useful for our data, because some of our features are highly correlated while others are not useful in predicting occupancy.



In [8]:

    
# Select the optimal alpha using Yellowbrick
from yellowbrick.regressor import AlphaSelection
from yellowbrick.regressor import PredictionError, ResidualsPlot
from sklearn.linear_model import LassoCV
from sklearn.linear_model import Lasso

model = AlphaSelection(LassoCV())
model.fit(X_train, y_train)
model.poof()



In [9]:

    
lasso = Lasso(alpha = 0.078)
y_pred_lasso = lasso.fit(X_train, y_train).predict(X_test)
print("Test set R^2: %.4f" 
      % lasso.score(X_test, y_test))
print("Mean squared error: %.4f"
      % np.mean((y_test - y_pred_lasso) ** 2))









    



Test set R^2: 0.7486
Mean squared error: 49.2921



In [10]:

    
# Coefficients for each feature
pd.DataFrame(lasso.coef_, names[0:6])









    Out[10]:






  
    
      
      0
    
  
  
    
      temperature
      -1.945128
    
    
      humidity
      -1.263178
    
    
      co2
      10.626593
    
    
      light
      -1.210172
    
    
      noise
      3.511164
    
    
      bluetooth_devices
      -0.000000



In [11]:

    
# Plot Regressor evaluation for Lasso 
visualizer_res = ResidualsPlot(lasso)
visualizer_res.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer_res.score(X_test, y_test)  # Evaluate the model on the test data
g = visualizer_res.poof()



In [12]:

    
# Plot prediction error plot for Lasso 
visualizer_pre = PredictionError(lasso)
visualizer_pre.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer_pre.score(X_test, y_test)  # Evaluate the model on the test data
g = visualizer_pre.poof()

"A residuals plot shows the residuals on the vertical axis and the independent variable on the horizontal axis. If the points are randomly dispersed around the horizontal axis, a linear regression model is appropriate for the data; otherwise, a non-linear model is more appropriate." (http://www.scikit-yb.org/en/latest/examples/methods.html#regressor-evaluation)

Based on the residual plot for Lasso Model, a linear model might not be approporiate for our data.

Elastic Net

Elastic net regression is a hybrid approach that blends both penalization of the L2 and L1 norms. Specifically, elastic net regression minimizes the following...

‖y−Xβ‖+λ[(1−α)|β|22+α|β|1]

the α hyper-parameter is between 0 and 1 and controls how much L2 or L1 penalization is used (0 is ridge, 1 is lasso). The aggressiveness of the penalty for overfitting is controlled by a parameter λ. The usual approach to optimizing the lambda hyper-parameter is through cross-validation—by minimizing the cross-validated mean squared prediction error—but in elastic net regression, the optimal lambda hyper-parameter also depends upon and is heavily dependent on the alpha hyper-parameter. (http://www.onthelambda.com/2015/08/19/kickin-it-with-elastic-net-regression/)



In [13]:

    
from sklearn.linear_model import ElasticNetCV
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import KFold 
from sklearn.model_selection import cross_val_score
from yellowbrick.regressor import AlphaSelection
from yellowbrick.regressor import PredictionError, ResidualsPlot



In [14]:

    
# Hyperparameter tuning with yellowbrick
model = AlphaSelection(ElasticNetCV())
model.fit(X_train, y_train)
model.poof()



In [15]:

    
# Fit Elastic Net Model using standarized data
elastic = ElasticNet(alpha = 0.022)
y_pred_elastic = elastic.fit(X_train, y_train).predict(X_test)
print("Test set R^2: %.4f" 
      % elastic.score(X_test, y_test))
print("Mean squared error: %.4f"
      % np.mean((y_test - y_pred_elastic) ** 2))









    



Test set R^2: 0.7487
Mean squared error: 49.2805



In [16]:

    
# Coefficients for each feature
pd.DataFrame(elastic.coef_, names[0:6])









    Out[16]:






  
    
      
      0
    
  
  
    
      temperature
      -1.953304
    
    
      humidity
      -1.363251
    
    
      co2
      10.482002
    
    
      light
      -1.314821
    
    
      noise
      3.464967
    
    
      bluetooth_devices
      -0.003796



In [17]:

    
# Regressor evaluation using Yellowbrick
visualizer_res = ResidualsPlot(elastic)
visualizer_res.fit(X_train, y_train)  # Fit the training data to the visualizer
visualizer_res.score(X_test, y_test)  # Evaluate the model on the test data
g = visualizer_res.poof()



In [18]:

    
# Instantiate the visualizer and fit
visualizer = PredictionError(elastic)
visualizer.fit(X_train, y_train)
visualizer.score(X_test, y_test)
visualizer.poof()

Gradient Boosting Regression



In [21]:

    
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn import grid_search
from yellowbrick.regressor import PredictionError, ResidualsPlot



In [31]:

    
clf = GradientBoostingRegressor(learning_rate = 0.1, random_state = 0)
parameters = {'max_depth': [7,8,9],'n_estimators':[50,100,150]}
gs = grid_search.GridSearchCV(clf, parameters,cv=5)
gs.fit(X_train,y_train)
gs.grid_scores_









    Out[31]:





[mean: 0.94939, std: 0.00608, params: {'max_depth': 7, 'n_estimators': 50},
 mean: 0.95477, std: 0.00566, params: {'max_depth': 7, 'n_estimators': 100},
 mean: 0.95651, std: 0.00591, params: {'max_depth': 7, 'n_estimators': 150},
 mean: 0.95214, std: 0.00648, params: {'max_depth': 8, 'n_estimators': 50},
 mean: 0.95651, std: 0.00696, params: {'max_depth': 8, 'n_estimators': 100},
 mean: 0.95731, std: 0.00688, params: {'max_depth': 8, 'n_estimators': 150},
 mean: 0.95504, std: 0.00626, params: {'max_depth': 9, 'n_estimators': 50},
 mean: 0.95791, std: 0.00732, params: {'max_depth': 9, 'n_estimators': 100},
 mean: 0.95837, std: 0.00743, params: {'max_depth': 9, 'n_estimators': 150}]



In [32]:

    
gs.best_params_









    Out[32]:





{'max_depth': 9, 'n_estimators': 150}



In [23]:

    
gbr = GradientBoostingRegressor(max_depth=9, n_estimators=150)
y_pred_gbr = gbr.fit(X_train, y_train).predict(X_test)
print("Test set R^2: %.4f" 
      % gbr.score(X_test, y_test))
print("Mean squared error: %.4f"
      % np.mean((y_test - y_pred_gbr) ** 2))









    



Test set R^2: 0.9667
Mean squared error: 6.5392



In [35]:

    
# Plot feature importance
params = names[0:6]
feature_importance = gbr.feature_importances_
sorted_features=sorted(zip(feature_importance,params))
importances,params_sorted=zip(*sorted_features)
#plt.ylim([-1,len()])
plt.barh(range(len(params)),importances,align='center',alpha=0.6,color='g')
plt.tick_params(axis='y', which='both', labelleft='off', labelright='on')

plt.yticks(range(len(params)),params_sorted,fontsize=12)
plt.xlabel('Mean Importance',fontsize=12)
plt.title('Mean feature importances\n for gradient boosting classifier')



In [25]:

    
# Plot Regressor evaluation for GradientBoostingRegressor
visualizer_res = ResidualsPlot(gbr)
visualizer_res.fit(X_train, y_train) 
visualizer_res.score(X_test, y_test)  
g = visualizer_res.poof()



In [26]:

    
# Plot prediction error plot for GradientBoostingRegressor
visualizer_pre = PredictionError(gbr)
visualizer_pre.fit(X_train, y_train)  
visualizer_pre.score(X_test, y_test)  
g = visualizer_pre.poof()

Random Forest Regressor



In [27]:

    
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn import grid_search
from yellowbrick.regressor import PredictionError, ResidualsPlot



In [33]:

    
rf = RandomForestRegressor(max_features = 'log2')
y_pred_rf = rf.fit(X_train, y_train).predict(X_test)
print("Test set R^2: %.4f" 
      % rf.score(X_test, y_test))
print("Mean squared error: %.4f"
      % np.mean((y_test - y_pred_rf) ** 2))









    



Test set R^2: 0.9586
Mean squared error: 8.1097



In [37]:

    
# Plot feature importance
params = names[0:6]
feature_importance = rf.feature_importances_
sorted_features=sorted(zip(feature_importance,params))
importances,params_sorted=zip(*sorted_features)
#plt.ylim([-1,len()])
plt.barh(range(len(params)),importances,align='center',alpha=0.6,color='g')
plt.tick_params(axis='y', which='both', labelleft='off', labelright='on')

plt.yticks(range(len(params)),params_sorted,fontsize=12)
plt.xlabel('Mean Importance',fontsize=12)
plt.title('Mean feature importances\n for random forest regressor')









    Out[37]:





<matplotlib.text.Text at 0x11a051898>



In [30]:

    
# Plot Regressor evaluation for Lasso using Yellowbrick
visualizer_res = ResidualsPlot(rf)
visualizer_res.fit(X_train, y_train) 
visualizer_res.score(X_test, y_test)  
g = visualizer_res.poof()



In [36]:

    
# Plot prediction error plot for RandomForestRegressor
visualizer_pre = PredictionError(rf)
visualizer_pre.fit(X_train, y_train)  
visualizer_pre.score(X_test, y_test)  
g = visualizer_pre.poof()

Conclusions

Based on the residual plots, linear models such as Lasso and ElasticNet are not the best to describe our data (poor performance for both train data and test data). Non-linear models (RandomForest, Gradient Boosting Regression) perform better (good performance for describing train data at least).
Based on r2 score and mean squared error, Gradient Boosting Regression has the best predicting performance among the four models tested.
CO2 demonstrates the highest importance in Lasso, ElasticNet and Random Forest Regressor. For Gradient Boosting Regression, though, light is the most important feature. Noise and temperature has little prediction value in our case, largely due to the restrictions on the data collection process (stable temperature in the classroom, low variability in noises).
I'm surprised about how high the prediction scores are for all models tested above (usually above 70% indicates very good model). My guess is certain features (CO2) is highly correlated with occupancy and gives the machine clear "hints" on occupancy. (What do you think?)



In [ ]:

	datetime	temperature	temperature_f	humidity	co2	light	light_status	noise	bluetooth_devices	image_hist_change	door_status	occupancy_category
0	2017-03-25 09:05:00	22.6	72.68	36.9	781.0	430.0	light-on	511.0	1.0	15.242697	closed	very-low
1	2017-03-25 09:05:00	22.6	72.68	36.9	781.0	430.0	light-on	511.0	1.0	15.242697	closed	very-low
2	2017-03-25 09:06:00	23.8	74.84	38.9	789.0	437.0	light-on	491.0	16.0	15.242697	closed	very-low

	datetime	temperature	temperature_f	humidity	co2	light	light_status	noise	bluetooth_devices	image_hist_change	door_status	occupancy_count	occupancy_category
0	2017-03-25 09:05:00	22.60	72.68	36.9	781.0	430.000000	light-on	511.000000	1.0	15.242697	closed	0.0	very-low
2	2017-03-25 09:06:00	23.80	74.84	38.9	789.0	437.000000	light-on	491.000000	16.0	15.242697	closed	0.0	very-low
4	2017-03-25 09:07:00	23.85	74.93	38.8	766.0	421.653846	light-on	506.586957	25.0	15.242697	closed	15.0	low

	0
temperature	-1.945128
humidity	-1.263178
co2	10.626593
light	-1.210172
noise	3.511164
bluetooth_devices	-0.000000

	0
temperature	-1.953304
humidity	-1.363251
co2	10.482002
light	-1.314821
noise	3.464967
bluetooth_devices	-0.003796